Colombia COVID-19

LINK: https://www.kaggle.com/camesruiz/colombia-covid19-complete-dataset

DESCRIPTION: Coronavirus (COVID-19) made its outbreak in Colombia with the first confirmed in the country on march 6th, since then, number of confirmed cases has been increasing and deaths related to the virus are starting to have the first confirmed cases. This data set contains complete information about confirmed cases, deaths and number of recovered patients according to the daily reports by the colombian health department (Ministerio de Salud)

GOAL: Build a model for the number of confirmed cases in the different Colombia regions. You have the access to some covariates, such as: Edad (age), Sexo (Sex), Pais de procedencia (origin country) of the individual cases. Try to test the predictive accuracy of your selected model.

ATTENTION: Three countries are here considered: Colombia, Mexico and India. Each different group of students should focus on a geographical sub-area of one of the three countries, say the northern, the central or the southern part of the countries, by pooling all the regions/states/departments belonging to the considered area. Say: group A focuses on Northern Mexico, group B on Central Mexico, and so on. The distinction in northern, central and southern is not strict, you have some flexibility.


Our Project

We decided to do central Colombia because it is where the capital is.

The largest cities in the country are Bogotá (in the center), Medellín (in the north, close to central), Cali (in the center) and Barranquilla (extreme north).

Dataset - First Exploration

##  [1] "Bogotá D.C."           "Valle del Cauca"      
##  [3] "Antioquia"             "Cartagena D.T. y C"   
##  [5] "Huila"                 "Meta"                 
##  [7] "Risaralda"             "Norte de Santander"   
##  [9] "Caldas"                "Cundinamarca"         
## [11] "Barranquilla D.E."     "Santander"            
## [13] "Quindío"               "Tolima"               
## [15] "Cauca"                 "Santa Marta D.T. y C."
## [17] "Cesar"                 "San Andrés"           
## [19] "Casanare"              "Nariño"               
## [21] "Atlántico"             "Boyacá"               
## [23] "Córdoba"               "Bolívar"              
## [25] "Sucre"                 "La Guajira"
## # A tibble: 26 x 2
## # Groups:   Departamento o Distrito [26]
##    `Departamento o Distrito`     n
##    <chr>                     <int>
##  1 Antioquia                   127
##  2 Atlántico                     4
##  3 Barranquilla D.E.            31
##  4 Bogotá D.C.                 542
##  5 Bolívar                       3
##  6 Boyacá                        6
##  7 Caldas                       16
##  8 Cartagena D.T. y C           39
##  9 Casanare                      2
## 10 Cauca                        12
## # ... with 16 more rows

Colombia is divided into 32 departments. According to Wikipedia we miss the Departments of Amazonas, Arauca, Caquetá, Chocó, Guainía, Guaviare, Magdalena, Putumayo, Vaupés, Vichada.

Bogotá, Distrito Capital in in the Cundinamarca Department. Barranquilla D.E. is a “Distrito Especial” but should be in the Atlántico Department.

The Districts (Spanish: Distrito) in Colombia are cities that have a feature that highlights them, such as its location and trade, history or tourism. Arguably, the districts are special municipalities. The districts are Bogotá, Barranquilla, Cartagena, Santa Marta, Cúcuta, Popayán, Tunja, Buenaventura, Turbo and Tumaco.

We miss Cúcuta, Popayán, Tunjaa, Buenaventura, Turbo and Tumaco.

# lat-long bogota<c(4.592164298, -74.072166378, 542)
valle_de_cauca <- c("Valle del Cauca", 3.43722, -76.5225, 150)
cauca <- c("Cauca", 2.43823, -76.6131592, 12)
antioquia <- c("Antioquia", 6.25184, -75.56359, 127)
cartagena <- c("Cartagena D.T. y C", 10.39972, -75.51444, 39)
huila <- c("Huila", 2.9273, -75.2818909, 30)
meta <- c("Meta", 4.142, -73.62664, 12)
risaralda <- c("Risaralda", 4.8133302, -75.6961136, 35)
norte_santander <- c("Norte de Santander", 7.89391, -72.50782, 21)
caldas <- c("Caldas", 5.06889, -75.51738, 16)
cudinamarca <- c("Cundinamarca", 4.862437, -74.058655, 42)
barraquilla <- c("Barranquilla D.E.", 10.96854, -74.78132, 35)  #atlantico
santader <- c("Santander", 7.12539, -73.1198, 12)
quindio <- c("Quindío", 4.535, -75.67569, 23)
tolima <- c("Tolima", 4.43889, -75.2322235, 14)
santa_marta <- c("Santa Marta D.T. y C.", 11.24079, -74.19904, 12)
cesar <- c("Cesar", 10.46314, -73.25322, 16)
san_andres <- c("San Andrés", 12.5847197, -81.7005615, 2)
casanare <- c("Casanare", 5.33775, -72.39586, 2)
narino <- c("Nariño", 1.21361, -77.28111, 6)
boyaca <- c("Boyacá", 5.767222, -72.940651, 6)
cordoba <- c("Córdoba", 8.74798, -75.88143, 2)
bolivar <- c("Bolívar", 10.39972, -75.51444, 3)
sucre <- c("Sucre", 9.3047199, -75.3977814, 1)
guajira <- c("La Guajira", 11.54444, -72.90722, 1)

gis_data <- data.frame(name = "Bogotá D.C.", latitude = 4.624335, longitude = -74.063644, 
    cases = 542)  #bogotà
gis_data$name <- as.character(gis_data$name)
gis_data <- rbind(gis_data, cauca)
gis_data <- rbind(gis_data, valle_de_cauca)
gis_data <- rbind(gis_data, antioquia)
gis_data <- rbind(gis_data, cartagena)
gis_data <- rbind(gis_data, huila)
gis_data <- rbind(gis_data, meta)
gis_data <- rbind(gis_data, risaralda)
gis_data <- rbind(gis_data, norte_santander)
gis_data <- rbind(gis_data, caldas)
gis_data <- rbind(gis_data, cudinamarca)
gis_data <- rbind(gis_data, barraquilla)
gis_data <- rbind(gis_data, santader)
gis_data <- rbind(gis_data, quindio)
gis_data <- rbind(gis_data, tolima)
gis_data <- rbind(gis_data, santa_marta)
gis_data <- rbind(gis_data, cesar)
gis_data <- rbind(gis_data, san_andres)
gis_data <- rbind(gis_data, casanare)
gis_data <- rbind(gis_data, narino)
gis_data <- rbind(gis_data, boyaca)
gis_data <- rbind(gis_data, cordoba)
gis_data <- rbind(gis_data, bolivar)
gis_data <- rbind(gis_data, sucre)
gis_data <- rbind(gis_data, guajira)

gis_data$latitude <- as.numeric(gis_data$latitude)
gis_data$longitude <- as.numeric(gis_data$longitude)
gis_data$cases <- as.numeric(gis_data$cases)

The color of the pins is related with the number of cases: if they are less than 10 the color is “green”, if they are less than 100 the color is “orange”, otherwise it is “red”.
On the map there are all the cities/departments for which we have data. We can notice that we don’t have any data in the south of the country.

Reading here and there I found that Colombia in divided in 5 regions, the central one comprises: Boyacá, Tolima, Cundinamarca, Meta, Bogotà DC.

ANGELA: Seeing Wikipedia I think that the Orinoquía Region (Meta, Arauca, Casanare and Vichada Departments) is in the center, so I would add also Arauca, Casanare and Vichada. I noticed that we only have Casanare, the other two doesn’t have data.

GAIA: added Casanare, for the others we have no data!

However, since in our assignment Colombia is divided in 3 parts, I think that we should add some more regions (e.g. Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá and possibly Antioquia and Santander)

Some very basics plots

Let’s check the situation (and also the power of ggplot)!

Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):

  • the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.

  • on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.


The major number of cases are in the capital Bogotà.


The previous plot represents the daily incidence of the desease across all the departments we are taking into account.

Let’s check the general trend by looking at the cumulative number of confirmed cases (again, all “our” departments are taken into account):

##    Fecha de diagnóstico Cumulative confirmed
## 1            2020-03-06                    1
## 2            2020-03-09                    3
## 3            2020-03-11                    9
## 4            2020-03-12                   11
## 5            2020-03-13                   16
## 6            2020-03-14                   24
## 7            2020-03-15                   45
## 8            2020-03-16                   57
## 9            2020-03-17                   75
## 10           2020-03-18                  102
## 11           2020-03-19                  128
## 12           2020-03-20                  175
## 13           2020-03-21                  210
## 14           2020-03-22                  240
## 15           2020-03-23                  306
## 16           2020-03-24                  419
## 17           2020-03-25                  481
## 18           2020-03-26                  491
## 19           2020-03-27                  539
## 20           2020-03-28                  603
## 21           2020-03-29                  700
## 22           2020-03-30                  798
## 23           2020-03-31                  906
## 24           2020-04-01                 1065
## 25           2020-04-02                 1161

Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).

In order to confirm it we should fit a log-linear model, and check that it produces a constant growth rate (straight line).

Now let’s explore the distribution of cases across genders and age:


Maybe in order to study the distribution of ages we should divide the ages in groups, for example 0-18, 18-30, 30-45, 45-60, 60-75, 75+.


This is quite surprising.. I expected elder people to be more affected by Covid-19!

The overall life expectancy in Colombia at birth is 74.8 years (71.2 years for males and 78.4 years for females). Wikipedia

Instead, the median age of the population in 2015 was 29.5 years (30.4 in 2018, 31.3 in 2020), so it is a Country full of young people! link or link or link

Now we can analyze jointly the distribution of age and sex (sex distribution across group of age):


There isn’t much difference between the sexes among the different group of ages, I have the impression that the covariates present in the dataset won’t help us!! :(

We are now left to explore the Tipo variable:


I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:

## # A tibble: 3 x 3
##   tipo        total_number percentage
##   <chr>              <int> <chr>     
## 1 Relacionado          291 29.3%     
## 2 Importado            467 47.0%     
## 3 En estudio           235 23.7%

Almost half of the total confirmed cases in our region of interest are imported, and a significant percentage is anknown wheter it is imported or not. Again this is in some sense interesting, but I don’t see clearly why this should be helpful in our model!

## # A tibble: 55 x 2
## # Groups:   País de procedencia [55]
##    `País de procedencia`     n
##    <chr>                 <int>
##  1 0                         1
##  2 Alemania                  4
##  3 Alemania - Estambul       1
##  4 Arabia                    1
##  5 Argentina                 2
##  6 Aruba                     2
##  7 Bélgica                   1
##  8 Brasil                   10
##  9 Canadá                    1
## 10 Chile                     2
## # ... with 45 more rows

here data are a bit dirty, however I don’t know if the effort of cleansing them will worth the result.. it depends wheter we decide to use this info in our analysis.

Analyze the epidemic curve separately for each department, considering only those that have more than 30 cases:

## # A tibble: 83 x 3
##    date       dep                 n
##    <date>     <chr>           <int>
##  1 2020-03-06 Bogotá D.C.         1
##  2 2020-03-09 Antioquia           1
##  3 2020-03-09 Valle del Cauca     1
##  4 2020-03-11 Antioquia           3
##  5 2020-03-11 Bogotá D.C.         2
##  6 2020-03-12 Bogotá D.C.         2
##  7 2020-03-13 Bogotá D.C.         1
##  8 2020-03-13 Valle del Cauca     1
##  9 2020-03-14 Antioquia           3
## 10 2020-03-14 Bogotá D.C.         5
## # ... with 73 more rows

Analyze the curve of cumulative confirmed cases on those “relevant” department:


We can see that (except for the Bogotà department) we have a lot of “missing” columns, these are the days in which no data was recorded for the corresponding department, the cumulative number of cases is the same as the previous day reported in the dataset. Maybe there is a way to fix this!

##          date             dep  n cumulative_dep elapsed_time
## 1  2020-03-06     Bogotá D.C.  1              1            0
## 2  2020-03-09       Antioquia  1              1            3
## 3  2020-03-09 Valle del Cauca  1              1            3
## 4  2020-03-11       Antioquia  3              4            5
## 5  2020-03-11     Bogotá D.C.  2              3            5
## 6  2020-03-12     Bogotá D.C.  2              5            6
## 7  2020-03-13     Bogotá D.C.  1              6            7
## 8  2020-03-13 Valle del Cauca  1              2            7
## 9  2020-03-14       Antioquia  3              7            8
## 10 2020-03-14     Bogotá D.C.  5             11            8
## 11 2020-03-15       Antioquia  1              8            9
## 12 2020-03-15     Bogotá D.C.  8             19            9
## 13 2020-03-15    Cundinamarca  1              1            9
## 14 2020-03-15       Risaralda  1              1            9
## 15 2020-03-15 Valle del Cauca  1              3            9
## 16 2020-03-16     Bogotá D.C. 11             30           10
## 17 2020-03-16    Cundinamarca  1              2           10
## 18 2020-03-17     Bogotá D.C.  9             39           11
## 19 2020-03-17 Valle del Cauca  2              5           11
## 20 2020-03-18     Bogotá D.C.  5             44           12
## 21 2020-03-18    Cundinamarca  2              4           12
## 22 2020-03-18       Risaralda  4              5           12
## 23 2020-03-18 Valle del Cauca  8             13           12
## 24 2020-03-19       Antioquia  3             11           13
## 25 2020-03-19     Bogotá D.C.  8             52           13
## 26 2020-03-19    Cundinamarca  1              5           13
## 27 2020-03-19 Valle del Cauca  1             14           13
## 28 2020-03-20       Antioquia 11             22           14
## 29 2020-03-20     Bogotá D.C. 29             81           14
## 30 2020-03-20    Cundinamarca  3              8           14
## 31 2020-03-20       Risaralda  1              6           14
## 32 2020-03-20 Valle del Cauca  1             15           14
## 33 2020-03-21       Antioquia  3             25           15
## 34 2020-03-21     Bogotá D.C.  6             87           15
## 35 2020-03-21    Cundinamarca  1              9           15
## 36 2020-03-21       Risaralda  2              8           15
## 37 2020-03-21 Valle del Cauca 11             26           15
## 38 2020-03-22       Antioquia  5             30           16
## 39 2020-03-22     Bogotá D.C.  4             91           16
## 40 2020-03-22       Risaralda  5             13           16
## 41 2020-03-22 Valle del Cauca  5             31           16
## 42 2020-03-23       Antioquia 22             52           17
## 43 2020-03-23     Bogotá D.C. 22            113           17
## 44 2020-03-23    Cundinamarca  5             14           17
## 45 2020-03-23       Risaralda  1             14           17
## 46 2020-03-23 Valle del Cauca  6             37           17
## 47 2020-03-24     Bogotá D.C. 48            161           18
## 48 2020-03-24    Cundinamarca  7             21           18
## 49 2020-03-24       Risaralda  3             17           18
## 50 2020-03-24 Valle del Cauca 29             66           18
## 51 2020-03-25       Antioquia  8             60           19
## 52 2020-03-25     Bogotá D.C. 16            177           19
## 53 2020-03-25    Cundinamarca  1             22           19
## 54 2020-03-25       Risaralda  2             19           19
## 55 2020-03-25 Valle del Cauca  5             71           19
## 56 2020-03-26     Bogotá D.C.  7            184           20
## 57 2020-03-26 Valle del Cauca  2             73           20
## 58 2020-03-27     Bogotá D.C. 38            222           21
## 59 2020-03-27    Cundinamarca  1             23           21
## 60 2020-03-28       Antioquia  7             67           22
## 61 2020-03-28     Bogotá D.C. 38            260           22
## 62 2020-03-28    Cundinamarca  2             25           22
## 63 2020-03-28 Valle del Cauca 11             84           22
## 64 2020-03-29       Antioquia 19             86           23
## 65 2020-03-29     Bogotá D.C. 33            293           23
## 66 2020-03-29       Risaralda 10             29           23
## 67 2020-03-29 Valle del Cauca  8             92           23
## 68 2020-03-30       Antioquia 10             96           24
## 69 2020-03-30     Bogotá D.C. 56            349           24
## 70 2020-03-30    Cundinamarca  4             29           24
## 71 2020-03-30 Valle del Cauca 13            105           24
## 72 2020-03-31       Antioquia  5            101           25
## 73 2020-03-31     Bogotá D.C. 41            390           25
## 74 2020-03-31    Cundinamarca  9             38           25
## 75 2020-03-31       Risaralda  6             35           25
## 76 2020-03-31 Valle del Cauca 11            116           25
## 77 2020-04-01       Antioquia  6            107           26
## 78 2020-04-01     Bogotá D.C. 82            472           26
## 79 2020-04-01    Cundinamarca  4             42           26
## 80 2020-04-01 Valle del Cauca 31            147           26
## 81 2020-04-02       Antioquia 20            127           27
## 82 2020-04-02     Bogotá D.C. 70            542           27
## 83 2020-04-02 Valle del Cauca  3            150           27

Missing

I still didn’t integrate the “other part” of the dataset, the one concerning deaths!

Ideas

For what concerns the predictive model we want to build, I think that we should start by something very simple (e.g. a (log)linear model) and take it as a baseline.

Then we build something more complex (such as a hierarchical model) and see the improvements with respect to the baseline.

If possible I would put inside something bayesian, since I understood that they really like this kind of stuff!

Gaia I start here with my stream of consciousness about the analysis!

Let’s fix some terminology (references for these are the first two links of the section “Interesting Links”):

  • a case is a person who tested positive for Covid-19;
  • the epidemic curve represents the daily incremental incidence;

Our data are tabulated by date of confirmation by lab test.

The variable we want to predict, say y, (dependent variable) is the (cumulative) number of confirmed cases, namely we want to simulate a (hopefully plausible) epidemic trajectory. Eventually we will project future incidence.

Since our y is a discrete count variable, a linear regression is not appropriate.

The most common choice with count data is to use Poisson regression, which belongs to the GLM class of models.

\[ ln\lambda_i = \beta_0 + \beta_1x_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \]

I think that, as first step, we should consider the most parsimonious model, taking only the time as independent variable. Here time can be intended both as the Fecha de diagnostico attribute of our dataset, or as the number of days elapsed from the earlier day in the dataset (in this case we should derive this attribute).

Further steps will include more covariates, and a model selection phase should follow. At the moment we are basically working with time series data.

Important: recall that modelling count data with a Poisson regression we are assuming equidispersion (the mean coincides with the variance), however we have no guarantees that this is true for our data, we need to take this into account!

The days 7/03/20, 8/03/20, 10/03/20 are missing, probably because no case was detected in those days (consistent with the fact that it was the very beginning of the outbreak).

##          date    y elapsed_time
## 1  2020-03-06    1            0
## 2  2020-03-09    3            3
## 3  2020-03-11    9            5
## 4  2020-03-12   11            6
## 5  2020-03-13   16            7
## 6  2020-03-14   24            8
## 7  2020-03-15   45            9
## 8  2020-03-16   57           10
## 9  2020-03-17   75           11
## 10 2020-03-18  102           12
## 11 2020-03-19  128           13
## 12 2020-03-20  175           14
## 13 2020-03-21  210           15
## 14 2020-03-22  240           16
## 15 2020-03-23  306           17
## 16 2020-03-24  419           18
## 17 2020-03-25  481           19
## 18 2020-03-26  491           20
## 19 2020-03-27  539           21
## 20 2020-03-28  603           22
## 21 2020-03-29  700           23
## 22 2020-03-30  798           24
## 23 2020-03-31  906           25
## 24 2020-04-01 1065           26
## 25 2020-04-02 1161           27

Country of origin

##  [1] "0"                                              
##  [2] "Alemania"                                       
##  [3] "Alemania - Estambul"                            
##  [4] "Arabia"                                         
##  [5] "Argentina"                                      
##  [6] "Aruba"                                          
##  [7] "Bélgica"                                        
##  [8] "Brasil"                                         
##  [9] "Canadá"                                         
## [10] "Chile"                                          
## [11] "Colombia"                                       
## [12] "Costa Rica"                                     
## [13] "Croacia"                                        
## [14] "Cuba"                                           
## [15] "Ecuador"                                        
## [16] "Egipto"                                         
## [17] "Emiratos Árabes"                                
## [18] "España"                                         
## [19] "España - Croacia"                               
## [20] "España - Croacia - Bosnia"                      
## [21] "España - Egipto"                                
## [22] "España - Francia"                               
## [23] "España - India"                                 
## [24] "España - Italia"                                
## [25] "España - Turquia"                               
## [26] "Estados Unidos"                                 
## [27] "Europa"                                         
## [28] "Francia"                                        
## [29] "Francia - Holanda"                              
## [30] "Grecia"                                         
## [31] "Guatemala"                                      
## [32] "Inglaterra"                                     
## [33] "Isla Martin - Caribe"                           
## [34] "Islas San Martin"                               
## [35] "Israel Egipto"                                  
## [36] "Italia"                                         
## [37] "Italia - España - Turquía"                      
## [38] "Italia - Jamaica - Panamá"                      
## [39] "Italia - Ucrania - España"                      
## [40] "Jamaica"                                        
## [41] "Jamaica - Isla Caimán - Panamá"                 
## [42] "Jamaica - Panamá - Isla Caimán"                 
## [43] "Jamaica - Panamá - Islas del caribe - Cartagena"
## [44] "Londres"                                        
## [45] "Madrid"                                         
## [46] "Marruecos"                                      
## [47] "México"                                         
## [48] "Panamá"                                         
## [49] "Panamá - Jamaica"                               
## [50] "Perú"                                           
## [51] "Puerto Rico"                                    
## [52] "República Dominicana"                           
## [53] "Suiza"                                          
## [54] "Turquía"                                        
## [55] "Turquía - Grecia"                               
## [56] "Venezuela"
## [1] 911
## # A tibble: 1 x 9
##   `ID de caso` `Fecha de diagn~ `Ciudad de ubic~ `Departamento o~
##          <dbl> <date>           <chr>            <chr>           
## 1         1077 2020-04-02       Medellín         Antioquia       
## # ... with 5 more variables: `Atención**` <chr>, Edad <dbl>, Sexo <chr>,
## #   tipo <chr>, `País de procedencia` <chr>
# Transforming it into a null value
library(naniar)
covid19 <- colombia_covid %>% replace_with_na(replace = list(`País de procedencia` = 0))

# Standardizing the country names
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Isla Martin - Caribe"] <- "Islas San Martin"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Israel Egipto"] <- "Israel - Egipto"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Jamaica - Isla Caimán - Panamá"] <- "Jamaica - Panamá - Isla Caimán"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Madrid"] <- "España"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Londres"] <- "Inglaterra"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Alemania - Estambul"] <- "Alemania - Turquía"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` == 
    "Jamaica - Panamá - Islas del caribe - Cartagena"] <- "Jamaica - Panamá - Colombia"

# Creating continents
Europa <- c("Alemania", "Bélgica", "Europa", "Croacia", "España", "España - Croacia", 
    "España - Croacia - Bosnia", "España - Francia", "España - Italia", "Francia", 
    "Francia - Holanda", "Grecia", "Inglaterra", "Italia", "Italia - Ucrania - España", 
    "Suiza")
Asia <- c("Arabia", "Emiratos Árabes", "Turquía")
África <- c("Egipto", "Marruecos")
Norteamérica <- c("Canadá", "Estados Unidos", "México")
Centroamérica <- c("Costa Rica", "Cuba", "Guatemala", "Islas San Martin", "Jamaica", 
    "Jamaica - Panamá - Isla Caimán", "Jamaica - Panamá - Islas del caribe - Cartagena", 
    "Panamá", "Panamá - Jamaica", "Puerto Rico", "República Dominicana", "Aruba")
Sudamerica <- c("Argentina", "Brasil", "Chile", "Ecuador", "Perú", "Venezuela")
# Alemania - Turquía', 'España - India', 'España - Turquia', 'Italia -
# España - Turquía', 'Turquía - Grecia' -> 'Europa - Asia' 'España - Egipto'
# -> 'Europa - África' 'Israel - Egipto' -> 'Asia - África' 'Italia -
# Jamaica - Panamá' -> 'Europa - Centroamérica' 'Colombia' -> 'Colombia'

for (i in 1:nrow(colombia_covid)) {
    if (colombia_covid$`País de procedencia`[i] %in% Europa) {
        colombia_covid$`Continente de procedencia`[i] <- "Europa"
    } else if (colombia_covid$`País de procedencia`[i] %in% Asia) {
        colombia_covid$`Continente de procedencia`[i] <- "Asia"
    } else if (colombia_covid$`País de procedencia`[i] %in% África) {
        colombia_covid$`Continente de procedencia`[i] <- "África"
    } else if (colombia_covid$`País de procedencia`[i] %in% Norteamérica) {
        colombia_covid$`Continente de procedencia`[i] <- "Norteamérica"
    } else if (colombia_covid$`País de procedencia`[i] %in% Centroamérica) {
        colombia_covid$`Continente de procedencia`[i] <- "Centroamérica"
    } else if (colombia_covid$`País de procedencia`[i] %in% Sudamerica) {
        colombia_covid$`Continente de procedencia`[i] <- "Sudamerica"
    } else if (colombia_covid$`País de procedencia`[i] == "Colombia") {
        colombia_covid$`Continente de procedencia`[i] <- "Colombia"
    } else if (colombia_covid$`País de procedencia`[i] == "Alemania - Turquía") {
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia"
    } else if (colombia_covid$`País de procedencia`[i] == "España - India") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "España - Turquia") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "Italia - España - Turquía") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "Turquía - Grecia") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "España - Egipto") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - África" else if (colombia_covid$`País de procedencia`[i] == "Israel - Egipto") 
        colombia_covid$`Continente de procedencia`[i] <- "Asia - África" else if (colombia_covid$`País de procedencia`[i] == "Italia - Jamaica - Panamá") 
        colombia_covid$`Continente de procedencia`[i] <- "Europa - Centroamérica"
}

Preparation of the dataset to run the Poisson regression

Giullia. So, as we have to group the data by date to be able to run the models, we have to count the occurance of the categorical variables in each day. So that’s what I did below and described each step in detail.

Running Poisson

Running poisson with just the variable representing the time as predictor

## 
## Call:
## glm(formula = y ~ elapsed_time, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4517  -4.6708  -0.6678   1.8257   6.6308  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.680213   0.048083   55.74   <2e-16 ***
## elapsed_time 0.167514   0.002137   78.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 9178.66  on 24  degrees of freedom
## Residual deviance:  335.87  on 23  degrees of freedom
## AIC: 506.28
## 
## Number of Fisher Scoring iterations: 4

Results show the variable is very significant.


According to the plot there is presence of overdispersion because the standardized residuals are outside the -1 to 1 range assumed by a Poisson distribution. In addition, the residuals are also distributed according to a Poisson. This is surprising and I don’t know what does it mean. I have to research.

## [1] 12.91942

Extremely high!!!

## 
## Call:
## glm(formula = y ~ elapsed_time, family = quasipoisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4517  -4.6708  -0.6678   1.8257   6.6308  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.680213   0.172829   15.51 1.14e-13 ***
## elapsed_time 0.167514   0.007681   21.81  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 12.9196)
## 
##     Null deviance: 9178.66  on 24  degrees of freedom
## Residual deviance:  335.87  on 23  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

## [1] 0.999986

So I applied the quasi poisson and the overdispersion reduced to 1. Anyway, the residuals are still behaving like a distribution. Have to understand why is that.

Running poisson with the variable representing time plus gender

## 
## Call:
## glm(formula = y ~ elapsed_time + Sexo_M, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4002  -4.6051  -0.6089   1.8922   7.1625  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.657545   0.052500  50.620   <2e-16 ***
## elapsed_time  0.170628   0.003574  47.737   <2e-16 ***
## Sexo_M       -0.001276   0.001172  -1.089    0.276    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 9178.66  on 24  degrees of freedom
## Residual deviance:  334.69  on 22  degrees of freedom
## AIC: 507.1
## 
## Number of Fisher Scoring iterations: 4

Results show that the variable gender is not signficant and thus doesn’t change the performance of the model. We were already expecting this as in the data visualization step the gender is equaly distributed across the confirmed cases.

Running poisson with the variable representing time plus the age variables

## 
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + 
##     `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, family = poisson)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.238  -2.569  -1.176   1.542   5.078  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.607872   0.055803  46.734  < 2e-16 ***
## elapsed_time    0.167161   0.003669  45.558  < 2e-16 ***
## `Edad_19 a 30` -0.017274   0.004794  -3.603 0.000314 ***
## `Edad_31 a 45`  0.011839   0.002400   4.933 8.11e-07 ***
## `Edad_46 a 60`  0.005595   0.003708   1.509 0.131349    
## `Edad_60 a 75` -0.029394   0.004861  -6.047 1.47e-09 ***
## `Edad_76+`      0.141382   0.017146   8.246  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 9178.66  on 24  degrees of freedom
## Residual deviance:  251.65  on 18  degrees of freedom
## AIC: 432.06
## 
## Number of Fisher Scoring iterations: 4

Results show that 4 out of the 5 age ranges are significant predictors, showing that age does have an effect in predicting the number of confirmed cases. Furthermore, it can be seen that the AIC reduced in comparison with the model that has just the variable time as predictor (poisson1).


## [1] 12.25888

The overdispersion is high and again the residuals continue to behave like a distribution.

## 
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + 
##     `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, family = quasipoisson)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.238  -2.569  -1.176   1.542   5.078  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.607872   0.195383  13.347 8.93e-11 ***
## elapsed_time    0.167161   0.012847  13.012 1.36e-10 ***
## `Edad_19 a 30` -0.017274   0.016785  -1.029   0.3170    
## `Edad_31 a 45`  0.011839   0.008403   1.409   0.1759    
## `Edad_46 a 60`  0.005595   0.012983   0.431   0.6716    
## `Edad_60 a 75` -0.029394   0.017019  -1.727   0.1013    
## `Edad_76+`      0.141382   0.060034   2.355   0.0301 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 12.25915)
## 
##     Null deviance: 9178.66  on 24  degrees of freedom
## Residual deviance:  251.65  on 18  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

When we apply the quasi poisson model the age variables become not signficant anymore. But we know they are. So, I think is not a problem.


## [1] 0.9999786

By applying the quasi poisson the overdispersion reduced to 1, and again the residuals are still behaving like a distribution.

Running poisson with the variable representing time, age and departments as predictors

## 
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + 
##     `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + 
##     `Departamento o Distrito_Valle del Cauca`, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1801  -0.1712   0.0206   0.2054   1.2347  
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                                0.87340    0.17266   5.059
## elapsed_time                               0.28936    0.01801  16.067
## `Edad_19 a 30`                             0.06079    0.02229   2.727
## `Edad_31 a 45`                             0.09157    0.02522   3.632
## `Edad_46 a 60`                             0.04203    0.03211   1.309
## `Edad_60 a 75`                             0.26977    0.13730   1.965
## `Edad_76+`                                -0.53785    0.39458  -1.363
## `Departamento o Distrito_Bogotá D.C.`     -0.12844    0.03583  -3.584
## `Departamento o Distrito_Boyacá`           0.50113    0.66081   0.758
## `Departamento o Distrito_Caldas`           0.58769    0.24182   2.430
## `Departamento o Distrito_Casanare`        -2.87494    1.64029  -1.753
## `Departamento o Distrito_Cauca`           -0.56323    0.32851  -1.715
## `Departamento o Distrito_Cundinamarca`     0.17663    0.03995   4.421
## `Departamento o Distrito_Meta`            -0.26061    0.11210  -2.325
## `Departamento o Distrito_Quindío`          0.35966    0.33542   1.072
## `Departamento o Distrito_Risaralda`       -0.48417    0.16977  -2.852
## `Departamento o Distrito_Santander`        0.52315    0.16763   3.121
## `Departamento o Distrito_Tolima`           0.82324    0.23550   3.496
## `Departamento o Distrito_Valle del Cauca` -0.16446    0.05116  -3.215
##                                           Pr(>|z|)    
## (Intercept)                               4.22e-07 ***
## elapsed_time                               < 2e-16 ***
## `Edad_19 a 30`                            0.006382 ** 
## `Edad_31 a 45`                            0.000282 ***
## `Edad_46 a 60`                            0.190512    
## `Edad_60 a 75`                            0.049443 *  
## `Edad_76+`                                0.172847    
## `Departamento o Distrito_Bogotá D.C.`     0.000338 ***
## `Departamento o Distrito_Boyacá`          0.448236    
## `Departamento o Distrito_Caldas`          0.015086 *  
## `Departamento o Distrito_Casanare`        0.079653 .  
## `Departamento o Distrito_Cauca`           0.086432 .  
## `Departamento o Distrito_Cundinamarca`    9.81e-06 ***
## `Departamento o Distrito_Meta`            0.020087 *  
## `Departamento o Distrito_Quindío`         0.283594    
## `Departamento o Distrito_Risaralda`       0.004345 ** 
## `Departamento o Distrito_Santander`       0.001803 ** 
## `Departamento o Distrito_Tolima`          0.000473 ***
## `Departamento o Distrito_Valle del Cauca` 0.001305 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 9178.6623  on 24  degrees of freedom
## Residual deviance:    5.8948  on  6  degrees of freedom
## AIC: 210.3
## 
## Number of Fisher Scoring iterations: 4

The results show that by including the variables representing departments the AIC reduces by half in comparison with the previous model that included just time and age as predictors. In addition, 8 out of 12 of the dummy variables representing the departments are significant.


## [1] 0.9232465

Now the residuals are behaving differently and the overdispersion reduced significantly and it is inside the range assumed by a Poisson model, that is from -1 to 1. Therefore, there is no need to apply a quasi poisson model because the assumptions of the Poisson distribution are holding.

Running poisson with the variable representing time, age, departments and continent of origin as predictors

## 
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + 
##     `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + 
##     `Departamento o Distrito_Valle del Cauca` + `Continente de procedencia_Asia` + 
##     `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` + 
##     `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` + 
##     `Continente de procedencia_Sudamerica`, family = poisson)
## 
## Deviance Residuals: 
##  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## [24]  0  0
## 
## Coefficients:
##                                           Estimate Std. Error z value
## (Intercept)                               -0.51009    1.68993  -0.302
## elapsed_time                               0.36021    0.08730   4.126
## `Edad_19 a 30`                             0.59286    1.27410   0.465
## `Edad_31 a 45`                             0.45615    0.72066   0.633
## `Edad_46 a 60`                             0.51685    1.28398   0.403
## `Edad_60 a 75`                             0.23141    1.59678   0.145
## `Edad_76+`                                 1.90354    4.64431   0.410
## `Departamento o Distrito_Bogotá D.C.`      0.08291    0.64669   0.128
## `Departamento o Distrito_Boyacá`          -2.47626    5.45722  -0.454
## `Departamento o Distrito_Caldas`          -3.60049    7.92550  -0.454
## `Departamento o Distrito_Casanare`         9.75029   25.41981   0.384
## `Departamento o Distrito_Cauca`            2.17874    5.68979   0.383
## `Departamento o Distrito_Cundinamarca`    -0.32088    1.04486  -0.307
## `Departamento o Distrito_Meta`            -0.05219    0.58172  -0.090
## `Departamento o Distrito_Quindío`         -1.47873    3.45139  -0.428
## `Departamento o Distrito_Risaralda`        0.56632    2.61766   0.216
## `Departamento o Distrito_Santander`        0.11161    3.50792   0.032
## `Departamento o Distrito_Tolima`           0.34553    3.65577   0.095
## `Departamento o Distrito_Valle del Cauca` -0.11354    0.60147  -0.189
## `Continente de procedencia_Asia`          -1.61027    2.77951  -0.579
## `Continente de procedencia_Centroamérica` -0.18590    0.55797  -0.333
## `Continente de procedencia_Colombia`      -0.60549    1.61235  -0.376
## `Continente de procedencia_Europa`        -0.16569    1.34195  -0.123
## `Continente de procedencia_Norteamérica`  -0.18888    0.52559  -0.359
## `Continente de procedencia_Sudamerica`    -0.97691    1.21679  -0.803
##                                           Pr(>|z|)    
## (Intercept)                                  0.763    
## elapsed_time                              3.69e-05 ***
## `Edad_19 a 30`                               0.642    
## `Edad_31 a 45`                               0.527    
## `Edad_46 a 60`                               0.687    
## `Edad_60 a 75`                               0.885    
## `Edad_76+`                                   0.682    
## `Departamento o Distrito_Bogotá D.C.`        0.898    
## `Departamento o Distrito_Boyacá`             0.650    
## `Departamento o Distrito_Caldas`             0.650    
## `Departamento o Distrito_Casanare`           0.701    
## `Departamento o Distrito_Cauca`              0.702    
## `Departamento o Distrito_Cundinamarca`       0.759    
## `Departamento o Distrito_Meta`               0.929    
## `Departamento o Distrito_Quindío`            0.668    
## `Departamento o Distrito_Risaralda`          0.829    
## `Departamento o Distrito_Santander`          0.975    
## `Departamento o Distrito_Tolima`             0.925    
## `Departamento o Distrito_Valle del Cauca`    0.850    
## `Continente de procedencia_Asia`             0.562    
## `Continente de procedencia_Centroamérica`    0.739    
## `Continente de procedencia_Colombia`         0.707    
## `Continente de procedencia_Europa`           0.902    
## `Continente de procedencia_Norteamérica`     0.719    
## `Continente de procedencia_Sudamerica`       0.422    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 9.1787e+03  on 24  degrees of freedom
## Residual deviance: 5.7288e-14  on  0  degrees of freedom
## AIC: 216.41
## 
## Number of Fisher Scoring iterations: 3

The AIC actually increased by adding the continent of origin variables and none of them are significant.

## 
## Call:
## glm(formula = y ~ elapsed_time + `Continente de procedencia_Asia` + 
##     `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` + 
##     `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` + 
##     `Continente de procedencia_Sudamerica`, family = poisson)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.718  -3.863  -1.109   1.954   7.110  
## 
## Coefficients:
##                                             Estimate Std. Error z value
## (Intercept)                                2.4260742  0.0714763  33.942
## elapsed_time                               0.1762243  0.0036602  48.146
## `Continente de procedencia_Asia`          -0.0366909  0.0091383  -4.015
## `Continente de procedencia_Centroamérica` -0.0062672  0.0060027  -1.044
## `Continente de procedencia_Colombia`       0.0002061  0.0009566   0.216
## `Continente de procedencia_Europa`         0.0151845  0.0050123   3.029
## `Continente de procedencia_Norteamérica`  -0.0070750  0.0047855  -1.478
## `Continente de procedencia_Sudamerica`     0.0163521  0.0104428   1.566
##                                           Pr(>|z|)    
## (Intercept)                                < 2e-16 ***
## elapsed_time                               < 2e-16 ***
## `Continente de procedencia_Asia`          5.94e-05 ***
## `Continente de procedencia_Centroamérica`  0.29645    
## `Continente de procedencia_Colombia`       0.82937    
## `Continente de procedencia_Europa`         0.00245 ** 
## `Continente de procedencia_Norteamérica`   0.13929    
## `Continente de procedencia_Sudamerica`     0.11738    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 9178.66  on 24  degrees of freedom
## Residual deviance:  265.58  on 17  degrees of freedom
## AIC: 447.99
## 
## Number of Fisher Scoring iterations: 4

In this case the AIC reduced in comparison with the model with just the time as predictor(poisson1), but just 2 out of 6 variables are significant, so we think is not a very relevant predictor to include in the model with the lowest AIC so far (poisson4).

Running Negative Binomial

Running negative binomial with just the variable representing the time as predictor

## 
## Call:
## glm.nb(formula = y ~ elapsed_time, init.theta = 9.13533603, link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.00519  -1.13044  -0.09778   0.89852   1.23492  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.66558    0.17828   9.342   <2e-16 ***
## elapsed_time  0.22176    0.01017  21.805   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(9.1353) family taken to be 1)
## 
##     Null deviance: 402.79  on 24  degrees of freedom
## Residual deviance:  28.65  on 23  degrees of freedom
## AIC: 273.01
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  9.14 
##           Std. Err.:  3.12 
## 
##  2 x log-likelihood:  -267.011

The predictor is significant and we get an AIC significantly lower than the poisson model runned with the same predictor.


The residuals are just in the lowerbound exceeding by approximately one unit the -1 to 1 variation range.

## [1] 1.377114

The overdispersion is bit higher than 1, but I think it is still an acceptable number. Gioia said than when it gets closer to 2 that you can accept the overdispersion assumption.

Running negative binomial with the variable representing time plus the age variables as predictors

## 
## Call:
## glm.nb(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + 
##     `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, init.theta = 9.865590193, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9495  -1.0266  -0.2049   0.6496   1.8060  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.584594   0.207628   7.632 2.31e-14 ***
## elapsed_time    0.233197   0.019051  12.241  < 2e-16 ***
## `Edad_19 a 30`  0.008471   0.025858   0.328    0.743    
## `Edad_31 a 45` -0.005270   0.015741  -0.335    0.738    
## `Edad_46 a 60`  0.001302   0.024565   0.053    0.958    
## `Edad_60 a 75` -0.030220   0.031220  -0.968    0.333    
## `Edad_76+`      0.032517   0.097564   0.333    0.739    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(9.8656) family taken to be 1)
## 
##     Null deviance: 431.362  on 24  degrees of freedom
## Residual deviance:  28.505  on 18  degrees of freedom
## AIC: 281.23
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  9.87 
##           Std. Err.:  3.38 
## 
##  2 x log-likelihood:  -265.235

The AIC increased in comparison with the previous model(nb1) and the age variables are not significant. That’s awkward. In the Poisson model they are.

Running negative binomial with the variable representing time plus the departments variables as predictors

## 
## Call:
## glm.nb(formula = y ~ elapsed_time + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + 
##     `Departamento o Distrito_Valle del Cauca`, init.theta = 58.37926959, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6198  -0.8635  -0.2064   0.6331   1.6697  
## 
## Coefficients:
##                                            Estimate Std. Error z value
## (Intercept)                                1.087686   0.181477   5.994
## elapsed_time                               0.290855   0.017550  16.573
## `Departamento o Distrito_Bogotá D.C.`     -0.023248   0.004339  -5.358
## `Departamento o Distrito_Boyacá`          -0.590608   0.153313  -3.852
## `Departamento o Distrito_Caldas`           0.021961   0.093427   0.235
## `Departamento o Distrito_Casanare`        -0.066599   0.215656  -0.309
## `Departamento o Distrito_Cauca`            0.116802   0.053646   2.177
## `Departamento o Distrito_Cundinamarca`     0.146286   0.037137   3.939
## `Departamento o Distrito_Meta`            -0.127833   0.054545  -2.344
## `Departamento o Distrito_Quindío`         -0.077581   0.053768  -1.443
## `Departamento o Distrito_Risaralda`        0.013413   0.065188   0.206
## `Departamento o Distrito_Santander`        0.004579   0.158742   0.029
## `Departamento o Distrito_Tolima`           0.126084   0.104386   1.208
## `Departamento o Distrito_Valle del Cauca` -0.020356   0.014350  -1.418
##                                           Pr(>|z|)    
## (Intercept)                               2.05e-09 ***
## elapsed_time                               < 2e-16 ***
## `Departamento o Distrito_Bogotá D.C.`     8.40e-08 ***
## `Departamento o Distrito_Boyacá`          0.000117 ***
## `Departamento o Distrito_Caldas`          0.814161    
## `Departamento o Distrito_Casanare`        0.757457    
## `Departamento o Distrito_Cauca`           0.029461 *  
## `Departamento o Distrito_Cundinamarca`    8.18e-05 ***
## `Departamento o Distrito_Meta`            0.019096 *  
## `Departamento o Distrito_Quindío`         0.149057    
## `Departamento o Distrito_Risaralda`       0.836983    
## `Departamento o Distrito_Santander`       0.976989    
## `Departamento o Distrito_Tolima`          0.227101    
## `Departamento o Distrito_Valle del Cauca` 0.156049    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(58.3793) family taken to be 1)
## 
##     Null deviance: 1829.773  on 24  degrees of freedom
## Residual deviance:   26.279  on 11  degrees of freedom
## AIC: 259.7
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  58.4 
##           Std. Err.:  23.0 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -229.698

The AIC decreased in comparison with the first model(nb1), but very littlle. In addition, just 5 out of the 12 department dummy variables are significant.

Running negative binomial with the variable representing time plus the continent of origin variables as predictors

## 
## Call:
## glm.nb(formula = y ~ elapsed_time + `Continente de procedencia_Asia` + 
##     `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` + 
##     `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` + 
##     `Continente de procedencia_Sudamerica`, init.theta = 19.98630333, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.60748  -0.73987  -0.02103   0.34297   2.28190  
## 
## Coefficients:
##                                            Estimate Std. Error z value
## (Intercept)                                1.133298   0.207234   5.469
## elapsed_time                               0.244959   0.014037  17.451
## `Continente de procedencia_Asia`          -0.013771   0.046720  -0.295
## `Continente de procedencia_Centroamérica` -0.052475   0.030717  -1.708
## `Continente de procedencia_Colombia`      -0.008440   0.004858  -1.737
## `Continente de procedencia_Europa`         0.057318   0.018248   3.141
## `Continente de procedencia_Norteamérica`  -0.002719   0.024493  -0.111
## `Continente de procedencia_Sudamerica`    -0.040410   0.045712  -0.884
##                                           Pr(>|z|)    
## (Intercept)                               4.53e-08 ***
## elapsed_time                               < 2e-16 ***
## `Continente de procedencia_Asia`           0.76817    
## `Continente de procedencia_Centroamérica`  0.08757 .  
## `Continente de procedencia_Colombia`       0.08232 .  
## `Continente de procedencia_Europa`         0.00168 ** 
## `Continente de procedencia_Norteamérica`   0.91162    
## `Continente de procedencia_Sudamerica`     0.37669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(19.9863) family taken to be 1)
## 
##     Null deviance: 793.472  on 24  degrees of freedom
## Residual deviance:  24.861  on 17  degrees of freedom
## AIC: 265.31
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  19.99 
##           Std. Err.:  6.57 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -247.309

The AIC decreased in comparison with the first model(nb1), but very littlle. In addition, just one out of the 6 department dummy variables is significant.

Applying ANOVA to compare the negative binomial models

## Likelihood ratio tests of Negative Binomial Models
## 
## Response: y
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Model
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                               elapsed_time
## 2                                                                                                                                                                                                                                                                                                                                                                              elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`
## 3                                                                                                                                                                                                                elapsed_time + `Continente de procedencia_Asia` + `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` + `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` + `Continente de procedencia_Sudamerica`
## 4 elapsed_time + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + `Departamento o Distrito_Valle del Cauca`
##       theta Resid. df    2 x log-lik.   Test    df  LR stat.      Pr(Chi)
## 1  9.135336        23       -267.0109                                    
## 2  9.865590        18       -265.2346 1 vs 2     5  1.776357 8.791431e-01
## 3 19.986303        17       -247.3089 2 vs 3     1 17.925679 2.297009e-05
## 4 58.379270        11       -229.6979 3 vs 4     6 17.610955 7.281668e-03

We already know the first model is the best. But I did for you to see that ANOVA is working in the Negative Binomial Models. I don’t know why is not working in Poisson.

Preliminary conclusions

In the case of the Poisson, the model with the lower AIC (210.3) includes 3 variables as predictor: time, age and departments

In the case of the Negative binomial, the model with the lower AIC (273.01) includes just the variable time as predictor.